[1] "C:/Users/UrsHu/Pillars/Learn/Academic/Master/Semestre 2/machine learning/Project/Machine_Learning"
1 Introduction
- Overview and Motivation
- Related Work
- Research questions
2 TESTING if R works and if Python works
#> [1] "hello"
#> 30.0
3 Data
- Sources
- Description
- Wrangling/cleaning
- Spotting mistakes and missing data (could be part of EDA too)
- Listing anomalies and outliers (could be part of EDA too)
3.1 Main dataset Cleaning
#> [1] "C:/Users/UrsHu/Pillars/Learn/Academic/Master/Semestre 2/machine learning/Project/Machine_Learning/docs"
#> price number_of_rooms address canton property_type
#> 1 1800000 65 1844 Villeneuve VD Vaud Apartment
#> 2 1980000 55 1820 Montreux Vaud Apartment
#> 3 488000 35 1882 Gryon Vaud Apartment
#> 4 1755000 7 1820 Montreux Vaud Apartment
#> 5 650000 25 1815 Clarens Vaud Apartment
#> 6 1490000 45 1260 Nyon Vaud Apartment
#> floor year_category
#> 1 eg 0-1919
#> 2 eg 0-1919
#> 3 eg 0-1919
#> 4 eg 0-1919
#> 5 eg 0-1919
#> 6 eg 0-1919
3.2 Creating Variable zip_code and merging with AMTOVZ_CSV_LV95
#> price number_of_rooms address canton property_type
#> 1 1800000 65 1844 Villeneuve VD Vaud Apartment
#> 2 1980000 55 1820 Montreux Vaud Apartment
#> 3 488000 35 1882 Gryon Vaud Apartment
#> 4 1755000 7 1820 Montreux Vaud Apartment
#> 5 650000 25 1815 Clarens Vaud Apartment
#> 6 1490000 45 1260 Nyon Vaud Apartment
#> floor year_category
#> 1 eg 0-1919
#> 2 eg 0-1919
#> 3 eg 0-1919
#> 4 eg 0-1919
#> 5 eg 0-1919
#> 6 eg 0-1919
#> price number_of_rooms address canton property_type
#> 1 1800000 65 1844 Villeneuve VD Vaud Apartment
#> 2 1980000 55 1820 Montreux Vaud Apartment
#> 3 488000 35 1882 Gryon Vaud Apartment
#> 4 1755000 7 1820 Montreux Vaud Apartment
#> 5 650000 25 1815 Clarens Vaud Apartment
#> 6 1490000 45 1260 Nyon Vaud Apartment
#> floor year_category zip_code
#> 1 eg 0-1919 1844
#> 2 eg 0-1919 1820
#> 3 eg 0-1919 1882
#> 4 eg 0-1919 1820
#> 5 eg 0-1919 1815
#> 6 eg 0-1919 1260
#> Ortschaftsname PLZ Zusatzziffer Gemeindename BFS.Nr
#> 1 Aeugst am Albis 8914 0 Aeugst am Albis 1
#> 2 Aeugstertal 8914 2 Aeugst am Albis 1
#> 3 Zwillikon 8909 0 Affoltern am Albis 2
#> 4 Affoltern am Albis 8910 0 Affoltern am Albis 2
#> 5 Bonstetten 8906 0 Bonstetten 3
#> 6 Sihlbrugg 6340 4 Hausen am Albis 4
#> Kantonskürzel E N Sprache Validity
#> 1 ZH 2679403 1235842 de 2008-07-01
#> 2 ZH 2679815 1237404 de 2008-07-01
#> 3 ZH 2675280 1238108 de 2008-07-01
#> 4 ZH 2676852 1236930 de 2008-07-01
#> 5 ZH 2677412 1241078 de 2008-07-01
#> 6 ZH 2686082 1230649 de 2008-07-01
#> City zip_code Canton_code
#> 1 Aeugst am Albis 8914 ZH
#> 2 Aeugstertal 8914 ZH
#> 3 Zwillikon 8909 ZH
#> 4 Affoltern am Albis 8910 ZH
#> 5 Bonstetten 8906 ZH
#> 6 Sihlbrugg 6340 ZH
#> zip_code price number_of_rooms
#> 1 25 2200000 10
#> 2 25 2200000 65
#> 3 26 1995000 75
#> 4 26 870490 45
#> 5 322 870000 25
#> 6 322 1295770 45
#> 2253 1200 2450000 6
#> 2254 1200 982130 45
#> 11886 1919 2535730 55
#> 11887 1919 230000 15
#> 11888 1919 1415380 35
#> 11889 1919 1043260 45
#> 11890 1919 2535730 55
#> 17993 2500 1050000 45
#> 17994 2500 1100000 5
#> 17995 2500 887500 55
#> 17996 2500 870500 45
#> 17997 2500 1176820 45
#> 17998 2500 1159550 35
#> 17999 2500 1927050 45
#> 18000 2500 892500 45
#> 18001 2500 887500 45
#> 18002 2500 420000 45
#> 18003 2500 877500 45
#> 18004 2500 885500 55
#> 18005 2500 872500 45
#> 19603 3000 1448610 45
#> 19604 3000 1515060 45
#> 19605 3000 956880 45
#> 19606 3000 1222680 35
#> 19607 3000 1448610 45
#> 19608 3000 1448610 45
#> 19609 3000 1515060 45
#> 19610 3000 820000 55
#> 19611 3000 1222680 35
#> 19612 3000 1590000 55
#> 19613 3000 1448610 45
#> 27169 4000 2100000 65
#> 27170 4000 975000 45
#> 30708 5201 963520 45
#> 33490 6511 584760 3
#> 33927 6547 19935000 55
#> 35207 6602 270000 15
#> 35208 6602 3721200 55
#> 35209 6602 3721200 55
#> 35210 6604 2644710 35
#> 35211 6604 2644710 35
#> 35212 6604 1142940 45
#> 35213 6604 610000 25
#> 35214 6604 810690 35
#> 35215 6604 860000 35
#> 35216 6604 917010 45
#> 35217 6604 1010040 45
#> 40817 6901 3628170 45
#> 40818 6911 877140 55
#> 40819 6911 810690 45
#> 40820 6911 730950 45
#> 40821 6911 465150 35
#> 42848 7133 2246010 35
#> 42861 7135 3575010 65
#> 43231 8000 1295770 35
#> 43232 8000 2100000 45
#> 43233 8000 2495000 55
#> 44144 8238 739000 35
#> 44145 8238 739000 35
#> 44146 8238 716000 35
#> 44147 8238 716000 35
#> 44148 8238 325600 3
#> 44889 8423 2910510 45
#> 44890 8423 2804190 55
#> 47001 9002 3787650 45
#> 47621 9241 724300 35
#> address
#> 1 1000 Lausanne 25
#> 2 1000 Lausanne 25
#> 3 Lausanne 26, 1000 Lausanne 26
#> 4 1000 Lausanne 26
#> 5 Via Cuolm Liung 30d, 7032 Laax GR 2
#> 6 Via Murschetg 29, 7032 Laax GR 2
#> 2253 1200 Genève
#> 2254 Chemin des pralets, 74100 Etrembières, 1200 Genève
#> 11886 1919 Martigny
#> 11887 1919 Martigny
#> 11888 1919 Martigny
#> 11889 1919 Martigny
#> 11890 1919 Martigny
#> 17993 Hohlenweg 11b, 2500 Biel/Bienne
#> 17994 2500 Biel/Bienne
#> 17995 2500 Biel/Bienne
#> 17996 2500 Biel/Bienne
#> 17997 2500 Biel/Bienne
#> 17998 2500 Biel/Bienne
#> 17999 2500 Bienne
#> 18000 2500 Biel/Bienne
#> 18001 2500 Biel/Bienne
#> 18002 2500 Biel/Bienne
#> 18003 2500 Biel/Bienne
#> 18004 2500 Biel/Bienne
#> 18005 2500 Biel/Bienne
#> 19603 3000 Bern
#> 19604 3000 Bern
#> 19605 3000 Bern
#> 19606 3000 Bern
#> 19607 3000 Bern
#> 19608 3000 Bern
#> 19609 3000 Bern
#> 19610 3000 Bern
#> 19611 3000 Bern
#> 19612 3000 Bern
#> 19613 3000 Bern
#> 27169 4000 Basel
#> 27170 4000 Basel
#> 30708 5201 Brugg AG
#> 33490 6511 Cadenazzo
#> 33927 Augio 1F, 6547 Augio
#> 35207 6602 Muralto
#> 35208 6602 Muralto
#> 35209 6602 Muralto
#> 35210 6604 Solduno
#> 35211 6604 Solduno
#> 35212 6604 Solduno
#> 35213 6604 Solduno
#> 35214 6604 Solduno
#> 35215 6604 Solduno
#> 35216 6604 Locarno
#> 35217 6604 Locarno
#> 40817 6901 Lugano
#> 40818 6911 Campione d'Italia
#> 40819 6911 Campione d'Italia
#> 40820 6911 Campione d'Italia
#> 40821 6911 Campione d'Italia
#> 42848 Inder Platenga 34, 7133 Obersaxen
#> 42861 7135 Fideris
#> 43231 8000 Zürich
#> 43232 8000 Zürich
#> 43233 8000 Zürich
#> 44144 8238 Büsingen am Hochrhein
#> 44145 8238 Büsingen am Hochrhein
#> 44146 Junkerstrasse 85, 8238 Büsingen am Hochrhein
#> 44147 Junkerstrasse 85, 8238 Büsingen am Hochrhein
#> 44148 Stemmerstrasse 14, 8238 Büsingen am Hochrhein
#> 44889 Chüngstrasse 48, 8423 Embrach
#> 44890 Chüngstrasse 60, 8423 Embrach
#> 47001 6900 Lugano 2 Paradiso Caselle
#> 47621 9241 Kradolf
#> canton property_type floor year_category City
#> 1 Vaud Single house 1919-1945 <NA>
#> 2 Vaud Villa 2006-2010 <NA>
#> 3 Vaud Villa 1961-1970 <NA>
#> 4 Vaud Apartment noteg 2016-2024 <NA>
#> 5 Grisons Apartment eg 2016-2024 <NA>
#> 6 Grisons Apartment noteg 2011-2015 <NA>
#> 2253 Geneva Bifamiliar house 1981-1990 <NA>
#> 2254 Geneva Bifamiliar house 2016-2024 <NA>
#> 11886 Valais Attic flat noteg 2016-2024 <NA>
#> 11887 Valais Apartment eg 2016-2024 <NA>
#> 11888 Valais Apartment noteg 2016-2024 <NA>
#> 11889 Valais Apartment noteg 2016-2024 <NA>
#> 11890 Valais Apartment noteg 2016-2024 <NA>
#> 17993 Bern Single house 2001-2005 <NA>
#> 17994 Bern Single house 2001-2005 <NA>
#> 17995 Bern Single house 2016-2024 <NA>
#> 17996 Bern Single house 2016-2024 <NA>
#> 17997 Bern Villa 2016-2024 <NA>
#> 17998 Bern Villa 2016-2024 <NA>
#> 17999 Bern Single house 2016-2024 <NA>
#> 18000 Bern Single house 2016-2024 <NA>
#> 18001 Bern Single house 2016-2024 <NA>
#> 18002 Bern Apartment noteg 1971-1980 <NA>
#> 18003 Bern Single house 2016-2024 <NA>
#> 18004 Bern Single house 2016-2024 <NA>
#> 18005 Bern Single house 2016-2024 <NA>
#> 19603 Bern Apartment eg 2016-2024 <NA>
#> 19604 Bern Apartment eg 2016-2024 <NA>
#> 19605 Bern Apartment eg 2016-2024 <NA>
#> 19606 Bern Apartment noteg 2016-2024 <NA>
#> 19607 Bern Apartment noteg 2016-2024 <NA>
#> 19608 Bern Apartment eg 2016-2024 <NA>
#> 19609 Bern Apartment eg 2016-2024 <NA>
#> 19610 Bern Apartment noteg 2016-2024 <NA>
#> 19611 Bern Duplex noteg 2016-2024 <NA>
#> 19612 Bern Apartment noteg 1991-2000 <NA>
#> 19613 Bern Roof flat noteg 2016-2024 <NA>
#> 27169 Basel-Stadt Villa 2016-2024 <NA>
#> 27170 Basel-Stadt Single house 2016-2024 <NA>
#> 30708 Aargau Apartment noteg 2016-2024 <NA>
#> 33490 Ticino Apartment noteg 2016-2024 <NA>
#> 33927 Grisons Single house 2016-2024 <NA>
#> 35207 Ticino Apartment eg 1961-1970 <NA>
#> 35208 Ticino Single house 1981-1990 <NA>
#> 35209 Ticino Single house 1981-1990 <NA>
#> 35210 Ticino Attic flat noteg 2011-2015 <NA>
#> 35211 Ticino Apartment noteg 2011-2015 <NA>
#> 35212 Ticino Apartment noteg 2016-2024 <NA>
#> 35213 Ticino Apartment noteg 2016-2024 <NA>
#> 35214 Ticino Apartment noteg 2016-2024 <NA>
#> 35215 Ticino Apartment noteg 2016-2024 <NA>
#> 35216 Ticino Apartment noteg 2011-2015 <NA>
#> 35217 Ticino Apartment noteg 2011-2015 <NA>
#> 40817 Ticino Attic flat noteg 2011-2015 <NA>
#> 40818 Ticino Single house 1971-1980 <NA>
#> 40819 Ticino Apartment eg 1946-1960 <NA>
#> 40820 Ticino Apartment noteg 1991-2000 <NA>
#> 40821 Ticino Apartment noteg 1946-1960 <NA>
#> 42848 Grisons Single house 2006-2010 <NA>
#> 42861 Grisons Single house 0-1919 <NA>
#> 43231 Zurich Single house 2016-2024 <NA>
#> 43232 Zurich Apartment noteg 2016-2024 <NA>
#> 43233 Zurich Apartment noteg 0-1919 <NA>
#> 44144 Schaffhausen Apartment eg 2016-2024 <NA>
#> 44145 Schaffhausen Attic flat eg 2016-2024 <NA>
#> 44146 Schaffhausen Attic flat noteg 2016-2024 <NA>
#> 44147 Schaffhausen Apartment noteg 2016-2024 <NA>
#> 44148 Schaffhausen Apartment noteg 1961-1970 <NA>
#> 44889 Zurich Single house 2016-2024 <NA>
#> 44890 Zurich Bifamiliar house 2016-2024 <NA>
#> 47001 Ticino Apartment noteg 2011-2015 <NA>
#> 47621 Thurgau Apartment noteg 1991-2000 <NA>
#> Canton_code
#> 1 <NA>
#> 2 <NA>
#> 3 <NA>
#> 4 <NA>
#> 5 <NA>
#> 6 <NA>
#> 2253 <NA>
#> 2254 <NA>
#> 11886 <NA>
#> 11887 <NA>
#> 11888 <NA>
#> 11889 <NA>
#> 11890 <NA>
#> 17993 <NA>
#> 17994 <NA>
#> 17995 <NA>
#> 17996 <NA>
#> 17997 <NA>
#> 17998 <NA>
#> 17999 <NA>
#> 18000 <NA>
#> 18001 <NA>
#> 18002 <NA>
#> 18003 <NA>
#> 18004 <NA>
#> 18005 <NA>
#> 19603 <NA>
#> 19604 <NA>
#> 19605 <NA>
#> 19606 <NA>
#> 19607 <NA>
#> 19608 <NA>
#> 19609 <NA>
#> 19610 <NA>
#> 19611 <NA>
#> 19612 <NA>
#> 19613 <NA>
#> 27169 <NA>
#> 27170 <NA>
#> 30708 <NA>
#> 33490 <NA>
#> 33927 <NA>
#> 35207 <NA>
#> 35208 <NA>
#> 35209 <NA>
#> 35210 <NA>
#> 35211 <NA>
#> 35212 <NA>
#> 35213 <NA>
#> 35214 <NA>
#> 35215 <NA>
#> 35216 <NA>
#> 35217 <NA>
#> 40817 <NA>
#> 40818 <NA>
#> 40819 <NA>
#> 40820 <NA>
#> 40821 <NA>
#> 42848 <NA>
#> 42861 <NA>
#> 43231 <NA>
#> 43232 <NA>
#> 43233 <NA>
#> 44144 <NA>
#> 44145 <NA>
#> 44146 <NA>
#> 44147 <NA>
#> 44148 <NA>
#> 44889 <NA>
#> 44890 <NA>
#> 47001 <NA>
#> 47621 <NA>
We have 144 NAN, where
- The zip code was not found in the atmo df
- The zip code was incorectly isolated from the address
Removed them ::: {.cell layout-align=“center”}
:::
3.3 Tax data cleaning
3.3.1 Merging the two datasets
dataset used for the rest of the analysis ::: {.cell layout-align=“center”}
:::
3.4 Cleaning of commune data
4 EDA
4.1 Change the path below
4.2 Histogram of prices
4.3 Histogram of prices for each property type
note : only price between 0 and 500000 so some outliers aren’t here
4.4 Histogram of prices for each year category
note : only price between 0 and 500000 so some outliers aren’t here
4.5 Histogram of prices for each canton
note : only price between 0 and 500000 so some outliers aren’t here
4.6 Histogram of prices for each number of rooms
note : only price between 0 and 500000 so some outliers aren’t here
and the graph below only show apartments with less than 10 rooms (but you can change the code if needed
4.7 Test Regression
#>
#> Call:
#> lm(formula = price ~ number_of_rooms + canton + property_type +
#> year_category, data = properties)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -7013788 -514438 -138948 264464 21628996
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -677158 55739 -12.15 < 2e-16
#> number_of_rooms 337946 6166 54.81 < 2e-16
#> cantonappenzell-ausser-rhoden -464945 126861 -3.66 0.00025
#> cantonappenzell-inner-rhoden -874289 392590 -2.23 0.02596
#> cantonbasel-landschaft -195701 57943 -3.38 0.00073
#> cantonbasel-stadt 218682 105130 2.08 0.03753
#> cantonbern -478376 46221 -10.35 < 2e-16
#> cantonfribourg -781416 48366 -16.16 < 2e-16
#> cantongeneva 2025260 62234 32.54 < 2e-16
#> cantonglarus -573694 173301 -3.31 0.00093
#> cantongrisons 59982 71666 0.84 0.40262
#> cantonjura -801519 77323 -10.37 < 2e-16
#> cantonlucerne -187979 73261 -2.57 0.01030
#> cantonneuchatel -353635 65590 -5.39 7.1e-08
#> cantonnidwalden 991055 244826 4.05 5.2e-05
#> cantonobwalden 366062 244712 1.50 0.13470
#> cantonschaffhausen -584997 120601 -4.85 1.2e-06
#> cantonschwyz 18070 132558 0.14 0.89157
#> cantonsolothurn -784557 61024 -12.86 < 2e-16
#> cantonst-gallen -404890 55918 -7.24 4.6e-13
#> cantonthurgau -37337 63444 -0.59 0.55620
#> cantonticino 125913 38499 3.27 0.00108
#> cantonuri 9578 155772 0.06 0.95097
#> cantonvalais -219964 39781 -5.53 3.3e-08
#> cantonvaud 89914 40258 2.23 0.02553
#> cantonzug 801241 153896 5.21 1.9e-07
#> cantonzurich 316099 49688 6.36 2.0e-10
#> property_typeAttic flat 311019 45964 6.77 1.4e-11
#> property_typeBifamiliar house 41841 42939 0.97 0.32986
#> property_typeChalet 1136804 56690 20.05 < 2e-16
#> property_typeDuplex -5091 56699 -0.09 0.92846
#> property_typeFarm house 237939 118848 2.00 0.04529
#> property_typeLoft 285442 291977 0.98 0.32827
#> property_typeRoof flat 4801 64587 0.07 0.94074
#> property_typeRustic house -281265 249068 -1.13 0.25880
#> property_typeSingle house 389066 24252 16.04 < 2e-16
#> property_typeTerrace flat 88662 87071 1.02 0.30856
#> property_typeVilla 1278283 38187 33.47 < 2e-16
#> year_category1919-1945 10462 61602 0.17 0.86515
#> year_category1946-1960 76025 57261 1.33 0.18429
#> year_category1961-1970 232055 48444 4.79 1.7e-06
#> year_category1971-1980 210609 43422 4.85 1.2e-06
#> year_category1981-1990 237789 43679 5.44 5.3e-08
#> year_category1991-2000 477554 45385 10.52 < 2e-16
#> year_category2001-2005 519338 55369 9.38 < 2e-16
#> year_category2006-2010 591351 48030 12.31 < 2e-16
#> year_category2011-2015 724194 47219 15.34 < 2e-16
#> year_category2016-2024 641233 36926 17.37 < 2e-16
#>
#> (Intercept) ***
#> number_of_rooms ***
#> cantonappenzell-ausser-rhoden ***
#> cantonappenzell-inner-rhoden *
#> cantonbasel-landschaft ***
#> cantonbasel-stadt *
#> cantonbern ***
#> cantonfribourg ***
#> cantongeneva ***
#> cantonglarus ***
#> cantongrisons
#> cantonjura ***
#> cantonlucerne *
#> cantonneuchatel ***
#> cantonnidwalden ***
#> cantonobwalden
#> cantonschaffhausen ***
#> cantonschwyz
#> cantonsolothurn ***
#> cantonst-gallen ***
#> cantonthurgau
#> cantonticino **
#> cantonuri
#> cantonvalais ***
#> cantonvaud *
#> cantonzug ***
#> cantonzurich ***
#> property_typeAttic flat ***
#> property_typeBifamiliar house
#> property_typeChalet ***
#> property_typeDuplex
#> property_typeFarm house *
#> property_typeLoft
#> property_typeRoof flat
#> property_typeRustic house
#> property_typeSingle house ***
#> property_typeTerrace flat
#> property_typeVilla ***
#> year_category1919-1945
#> year_category1946-1960
#> year_category1961-1970 ***
#> year_category1971-1980 ***
#> year_category1981-1990 ***
#> year_category1991-2000 ***
#> year_category2001-2005 ***
#> year_category2006-2010 ***
#> year_category2011-2015 ***
#> year_category2016-2024 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 1240000 on 21363 degrees of freedom
#> (72 observations deleted due to missingness)
#> Multiple R-squared: 0.323, Adjusted R-squared: 0.321
#> F-statistic: 216 on 47 and 21363 DF, p-value: <2e-16
5 Supervised learning
- Data splitting (if a training/test set split is enough for the global analysis, at least one CV or bootstrap must be used)
- Two or more models
- Two or more scores
- Tuning of one or more hyperparameters per model
- Interpretation of the model(s)
6 Unsupervised learning
- Clustering and/or dimension reduction
7 Conclusion
- Brief summary of the project
- Take home message
- Limitations
- Future work?